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Abstract 

High-dimensional inference refers to problems of statistical estimation in which the 
ambient dimension of the data may be comparable to or possibly even larger than the 
sample size. We study an instance of high-dimensional inference in which the goal is to 
estimate a matrix 6* € M fcxp on the basis of N noisy observations, and the unknown 
matrix 0* is assumed to be either exactly low rank, or "near" low-rank, meaning that it 
can be well-approximated by a matrix with low rank. We consider an M-estimator based 
on regularization by the trace or nuclear norm over matrices, and analyze its performance 
under high-dimensional scaling. We provide non-asymptotic bounds on the Frobenius 
norm error that hold for a general class of noisy observation models, and then illustrate 
their consequences for a number of specific matrix models, including low-rank multivari- 
ate or multi-task regression, system identification in vector autoregressive processes, and 
recovery of low-rank matrices from random projections. Simulation results show excellent 
agreement with the high-dimensional scaling of the error predicted by our theory. 

1 Introduction 

High-dimensional inference refers to instances of statistical estimation in which the ambient 
dimension of the data is comparable to (or possibly larger than) the sample size. Problems 
with a high-dimensional character arise in a variety of applications in science and engineering, 
including analysis of gene array data, medical imaging, remote sensing, and astronomical 
data analysis. In settings where the number of parameters may be large relative to the 
sample size, the utility of classical "fixed p" results is questionable, and accordingly, a line of 
on-going statistical research seeks to obtain results that hold under high-dimensional scaling, 
meaning that both the problem size and sample size (as well as other problem parameters) 
may tend to infinity simultaneously. It is usually impossible to obtain consistent procedures 
in such settings without imposing some sort of additional constraints. Accordingly, there are 
now various lines of work on high-dimensional inference based on imposing different types of 
structural constraints. A substantial body of past work has focused on models with sparsity 
constraints, including the problem of sparse linear regression [HI HU H3 [Ml E], banded or 
sparse covariance matrices [8j [18] , sparse inverse covariance matrices [Ml HH H21 HQ] , sparse 
eigenstructure [271 El EHJ , and sparse regression matrices [371 EQl SHI [25] . A theme common 
to much of this work is the use of £i-penalty as a surrogate function to enforce the sparsity 
constraint. 
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In this paper, we focus on the problem of high-dimensional inference in the setting of 
matrix estimation. As mentioned above, there is already a substantial body of work on the 
problem of sparse matrix recovery. In contrast, our interest in this paper is the problem of 
estimating a matrix O* £ BL fcxp that is either exactly low rank, meaning that it has at most 
r <C min{k,p} non-zero singular values, or more generally is near low-rank, meaning that it 
can be well-approximated by a matrix of low rank. As we discuss at more length in the sequel, 
such exact or approximate low-rank conditions are appropriate for many applications, includ- 
ing multivariate or multi-task forms of regression, system identification for autoregressive 
processes, collaborative filtering, and matrix recovery from random projections. Analogous 
to the use of an ^i-regularizer for enforcing sparsity, we consider the use of the nuclear norm 
(also known as the trace norm) for enforcing a rank constraint in the matrix setting. By 
definition, the nuclear norm is the sum of the singular values of a matrix, and so encourages 
sparsity in the vector of singular values, or equivalently for the matrix to be low-rank. The 
problem of low-rank matrix approximation and the use of nuclear norm regularization have 
been studied by various researchers. In her Ph.D. thesis, Fazel |19] discusses the use of nuclear 
norm as a heuristic for restricting the rank of a matrix, showing that in practice it is often 
able to yield low-rank solutions. Other researchers have provided theoretical guarantees on 
the performance of nuclear norm and related methods for low-rank matrix approximation. 
Srebro et al. [43] proposed nuclear norm regularization for the collaborative filtering problem, 
and established risk consistency under certain settings. Recht et al. [H] provided sufficient 
conditions for exact recovery using the nuclear norm heuristic when observing random pro- 
jections of a low-rank matrix, a set-up analogous to the compressed sensing model in sparse 
linear regression [17\ 112] . Other researchers have studied a version of matrix completion in 
which a subset of entries are revealed, and the goal is to obtain perfect reconstruction either 
via the nuclear norm heuristic [13] or by other SVD-based methods [28]. Finally, Bach [6] 
has provided results on the consistency of nuclear norm minimization for general observation 
models in noisy settings, but applicable to the classical "fixed p" setting. 

The goal of this paper is to analyze the nuclear norm relaxation for a general class of noisy 
observation models, and obtain non-asymptotic error bounds on the Frobenius norm that hold 
under high-dimensional scaling, and are applicable to both exactly and approximately low- 
rank matrices. We begin by presenting a generic observation model, and illustrating how it can 
be specialized to the several cases of interest, including low-rank multivariate regression, esti- 
mation of autoregressive processes, and random projection (compressed sensing) observations. 
In particular, this model is specified in terms of an operator X, which may be deterministic 
or random depending on the setting, that maps any matrix O* S M kxp to a vector of N noisy 
observations. We then present a single main theorem (Theorem [1]) followed by two corollaries 
that cover the cases of exact low-rank constraints (Corollary Q]) and near low-rank constraints 
(Corollary [2]) respectively. These results demonstrate that high-dimensional error rates are 
controlled by two key quantities. First, the (random) observation operator X is required to 
satisfy a condition known as restricted strong convexity (RSC), which ensures that the loss 
function has sufficient curvature to guarantee consistent recovery of the unknown matrix ©*. 
Second, our theory provides insight into the choice of regularization parameter that weights 
the nuclear norm, showing that an appropriate choice is to set it proportional to the spectral 
norm of a random matrix defined by the adjoint of observation operator X, and the observation 
noise in the problem. 

This initial set of results, though appealing in terms of their simple statements and general- 
ity, are somewhat abstractly formulated. Our next contribution is to show that by specializing 
our main result (Theorem [T]) to three classes of models, we can obtain some concrete results 
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based on readily interpretable conditions. In particular, Corollary [3] deals with the case of 
low-rank multivariate regression, relevant for applications in multitask learning. We show 
that the random operator X satisfies the RSC property for a broad class of observation mod- 
els, and we use random matrix theory to provide an appropriate choice of the regularization 
parameter. Our next result, Corollary HI deals with the case of estimating the matrix of 
parameters specifying a vector autoregressive (VAR) process [U [31] . Here we also establish 
that a suitable RSC property holds with high probability for the random operator X, and also 
specify a suitable choice of the regularization parameter. We note that the technical details 
here are considerably more subtle than the case of low-rank multivariate regression, due to 
dependencies introduced by the autoregressive sampling scheme. Accordingly, in addition 
to terms that involve the size, the matrix dimensions and rank, our bounds also depend on 
the mixing rate of the VAR process. Finally, we turn to the compressed sensing observation 
model for low-rank matrix recovery, as introduced by Recht et al. [41J. In this setting, we 
again establish that the RSC property holds with high probability, specify a suitable choice 
of the regularization parameter, and thereby obtain a Frobenius error bound for noisy obser- 
vations (Corollary [5]). A technical result that we prove en route — namely, Proposition Q] — is 
of possible independent interest, since it provides a bound on the constrained norm of a ran- 
dom Gaussian operator. In particular, this proposition allows us to obtain a sharp result 
(Corollary [6]) for the problem of recovering a low-rank matrix from perfectly observed random 
projections, one that removes a logarithmic factor from past work |41j . 

The remainder of this paper is organized as follows. Section [2] is devoted to background 
material, and the set-up of the problem. We present a generic observation model for low- 
rank matrices, and then illustrate how it captures various cases of interest. We then define 
the convex program based on nuclear norm regularization that we analyze in this paper. In 
Section [3j we state our main theoretical results and discuss their consequences for different 
model classes. Section |4] is devoted to the proofs of our results; in each case, we break down 
the key steps in a series of lemmas, with more technical details deferred to the appendices. 
In Section [5l we present the results of various simulations that illustrate excellent agreement 
between the theoretical bounds and empirical behavior. 

Notation: For the convenience of the reader, we collect standard pieces of notation here. 
For a pair of matrices G and T with commensurate dimensions, we let ((G, V)) = trace(Q T r) 
denote the trace inner product on matrix space. For a matrix G G R fcxp , we let m = min{/c,p}, 
and denote its (ordered) singular values by 01(G) > 02(G) > . . . > o~ m {Q) > 0. We also use the 
notation ma x 

(G) = X (G) and 0min(G) — m (Q) to refer to the maximal and minimal singular 
values respectively. We use the notation ||| • ||| for various types of matrix norms based on these 
singular values, including the nuclear norm |||G|||i = ^j=i0j(G), the spectral or operator 

norm |||G|||op = 0i(Q), and the Frobenius norm |||Q|||f = \/trace(Q T @) = yjY^JLi ^f(®)- We 
refer the reader to Horn and Johnson |23l |2"4] for more background on these matrix norms 
and their properties. 

2 Background and problem set-up 

We begin with some background on problems and applications in which rank constraints arise, 
before describing a generic observation model. We then introduce the semidefinite program 
(SDP) based on nuclear norm regularization that we study in this paper. 
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2.1 Models with rank constraints 

Imposing a rank r constraint on a matrix G* E M kxp is equivalent to requiring the rows (or 
columns) of Q* lie in some r-dimensional subspace of M p (or M. k respectively). Such types 
of rank constraints (or approximate forms thereof) arise in a variety of applications, as we 
discuss here. In some sense, rank constraints are a generalization of sparsity constraints; 
rather than assuming that the data is sparse in a known basis, a rank constraint implicitly 
imposes sparsity but without assuming the basis. 

We first consider the problem of multivariate regression, also referred to as multi-task 
learning in statistical machine learning. The goal of multivariate regression is to estimate a 
prediction function that maps covariates Zj £ W to multi-dimensional output vectors Yj £ M fc . 
More specifically, let us consider the linear model, specified by a matrix G* £ M fcxp , of the 
form 

Y a = Q*Z a + W a , for a = l,...,n, (1) 

where {W a }" =1 is an i.i.d. sequence of fe-dimensional zero-mean noise vectors. Given a collec- 
tion of observations {Z a , Y a }™ =1 of covariate-output pairs, our goal is to estimate the unknown 
matrix G*. This type of model has been used in many applications, including analysis of fMRI 
image data [22], analysis of EEG data decoding [3], neural response modeling [11] and analysis 
of financial data. This model and closely related ones also arise in the problem of collaborative 
filtering [33], in which the goal is to predict users' preferences for items (such as movies or 
music) based on their and other users' ratings of related items. The papers [U [5] discuss ad- 
ditional instances of low-rank decompositions. In all of these settings, the low-rank condition 
translates into the existence of a smaller set of "features" that are actually controlling the 
prediction. 

As a second (not unrelated) example, we now consider the problem of system identification 
in vector autoregressive processes (see the book [31] for detailed background). A vector 
autoregressive (VAR) process in p-dimensions is a a stochastic process {Zf}^ specified by 
an initialization Z\ £ M p , followed by the recursion 

Z t+1 = Q*Z t + W t , fort = 1,2,3,.... (2) 

In this recursion, the sequence {Wt}^ consists of i.i.d. samples of innovations noise. We 
assume that each vector Wt £ W is zero-mean with covariance u 2 I, so that the process {Zt}^i 
is zero-mean, and has a covariance matrix £ given by the solution of the discrete-time Ricatti 
equation 

S = G*£(Q*) T + i/ 2 /. (3) 

The goal of system identification in a VAR process is to estimate the unknown matrix 
G* £ R pxp on the basis of a sequence of samples {2^}" =1 . In many application domains, 
it is natural to expect that the system is controlled primarily by a low-dimensional subset 
of variables. For instance, models of financial data might have an ambient dimension p of 
thousands (including stocks, bonds, and other financial instruments), but the behavior of the 
market might be governed by a much smaller set of macro-variables (combinations of these 
financial instruments). Similar statements apply to other types of time series data, including 
neural data |11[ I20], subspace tracking models in signal processing, and motion models models 
in computer vision. 
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A third example that we consider in this paper is a compressed sensing observation model, 
in which one observes random projections of the unknown matrix 0*. This observation model 
has been studied extensively in the context of estimating sparse vectors |17tll2|. and Recht et 
al. [H] suggested and studied its extension to low-rank matrices. In their set-up, one observes 
trace inner products of the form ((Xi, 0*)) = trace(X?"0*), where Xi G M. kxp is a random 
matrix (for instance, filled with standard normal N(0, 1) entries). Like compressed sensing 
for sparse vectors, applications of this model include computationally efficient updating in 
large databases (where the matrix 0* measures the difference between the data base at two 
different time instants), and matrix denoising. 

2.2 A generic observation model 

We now introduce a generic observation model that will allow us to deal with these different 
observation models in an unified manner. For pairs of matrices A, B S R fcxp , recall the Frobe- 
nius or trace inner product ((A, B}} := trace(-B^4 T ). We then consider a linear observation 
model of the form 

y i = {{X i ,Q*)) + e i , fori = 1,2,..., A, (4) 

which is specified by the sequence of observation matrices {Xi}^L 1 and observation noise 
{£i}^Li- This observation model can be written in a more compact manner using operator- 
theoretic notation. In particular, let us define the observation vector 

y= [yi ■■■ y n ] T €R N , 

with a similar definition for e € M. N in terms of {£i}^Li- We then use the observation matrices 
{Xi}f =1 to define an operator X : M. kxp -> R N via = {{X t , 0)). With this notation, 

the observation model (J2|) can be re-written as 

y = X(e*)+e. (5) 

Let us illustrate the form of the observation model (0) for some of the applications that 
we considered earlier. 

Example 1 (Multivariate regression). Recall the observation model JT]) for multivariate re- 
gression. In this case, we make n observations of vector pairs (Y a , Z a ) g R k x W . Accounting 
for the /c-dimensional nature of the output, after the model is scalarized, we receive a total 
of A = /era observations. Let us introduce the quantity b = 1, . . . ,k to index the different 
elements of the output, so that we can write 

Y ab = ((Z a el, 0*)) + W ab , for b = 1, 2, ... ,k. (6) 

By re-indexing this collection of N = nk observations via the mapping (a, b) i— > i = a + (b — 
l)k, we recognize multivariate regression as an instance of the observation model ((3]) with 
observation matrix Xi = Z a eJ and scalar observation yi = V a i,. 

Example 2 (Vector autoregressive processes). Recall that a vector autoregressive (VAR) 
process is defined by the recursion ([2]), and suppose that observe an n-sequence {Z t }™ =1 
produced by this recursion. Since each Z t = [Z t ± . . . Z tp \ T is p-variate, the scalarized 
sample size is = np. Letting b = 1,2, ... ,p index the dimension, we have 

Z it+1)b = ((Z t el, @*)) + W tb . (7) 
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In this case, we re-index the collection of N = np observations via the mapping (t, b) i— » i = 
t + (b — 1) p. After doing so, we see that the autoregressive problem can be written in the 
form with j/j = and observation matrix X\ = ZteJ . 



Example 3 (Compressed sensing). As mentioned earlier, this is a natural extension of the 
compressed sensing observation model for sparse vectors to the case of low-rank matrices [5T] . 
In particular, suppose that each observation matrix X{ G M. kxp has i.i.d. standard normal 
N(0, 1) entries, so that we make observations of the form 



By construction, these observations are an instance of the model In this case, the more 
compact form ([5]) involves a random Gaussian operator mapping R fcxp to M n , and we study 
some of its properties in the sequel. 

2.3 Regression with nuclear norm regularization 

We now consider an estimator that is naturally suited to the problems described in the 
previous section. Recall that the nuclear or trace norm of a matrix G M. kxp is given by 
lll®llli = ^2j l ^i' k ' P ^ corresponding to the sum of its singular values. Given a collection 

of observations (yi,Xi) G M. x ]R fcxp , for i = 1,...,N from the observation model @, we 
consider estimating the unknown 0* by solving the following optimization problem 



where Ajy > is a regularization parameter. Note that the optimization problem Q can be 
viewed as the analog of the Lasso estimator |45j . tailored to low-rank matrices as opposed to 
sparse vectors. An important property of the optimization problem Q is that it can be solved 
in time polynomial in the sample size iV and the matrix dimensions k and p. Indeed, the 
optimization problem ([9]) is an instance of a semidefinite program [46], a class of convex opti- 
mization problems that can be solved efficiently by various polynomial-time algorithms [10j . 
For instance, interior point methods are a classical method for solving semidefinite programs; 
moreover, as we discuss in Section [5j there are a variety of other methods for solving the 
semidefinite program (SDP) defining our M-estimator 

Like in any typical M-estimator for statistical inference, the regularization parameter Atv 
is specified by the statistician. As part of the theoretical results in the next section, we provide 
suitable choices of this parameter in order for the estimate to behave well, in the sense of 
being close in Frobenius norm to the unknown matrix 0* . 

3 Main results and some consequences 

In this section, we state our main results and discuss some of their consequences. Section f3.il 
is devoted to results that apply to generic instances of low-rank problems, whereas Section T3.2I 
is devoted to the consequences of these results for more specific problem classes, including 
low-rank multivariate regression, estimation of vector autoregressive processes, and recovery 
of low-rank matrices from random projections. 




for i = 1, 2, . . . , n. 



(8) 




(9) 
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3.1 Results for general model classes 

We begin by introducing the key technical condition that allows us to control the error O — O* 
between an SDP solution G and the unknown matrix O* . We refer to it as the restricted strong 
convexity condition [35J, since it amounts to guaranteeing that the quadratic loss function in 
the SDP ([9]) is strictly convex over a restricted set of directions. Letting C C R kxp denote the 
restricted set of directions, we say that the operator X satisfies restricted strong convexity 
(RSC) over the set C if there exists some k(X) > such that 

^||X(A)||2 > K (£) |||A|||| for all A G C. (10) 

We note that analogous conditions have been used to establish error bounds in the context 
of sparse linear regression [21 [TS], in which case the set C corresponded to certain subsets of 
sparse vectors. 

Of course, the definition (|10p hinges on the choice of the restricted set C. In order to 
specify some appropriate sets for the case of low-rank matrices, we require some additional 
notation. For any matrix G G M fcxp , we let row(@) C R p and col(G) C R k denote its row space 
and column space, respectively. For a given positive integer r < min{k,p}, any r-dimensional 
subspace of R k can be represented by some orthogonal matrix U G R kxr (i.e., that satisfies 
U T U = I r xr- In a similar fashion, any r-dimensional subspace of W can be represented by 
an orthogonal matrix V G ]R pxr . For any fixed pair of such matrices (U,V), we may define 
the following two subspaces of M fcxp : 

M(U, V) := {Q G R kxp | row(G) = V and col(Q) = U}, and (11a) 
M ± {U, V) := {0 G R kxp | row(G) J_ V and col(Q) LU.}. (lib) 

Finally, we let Hm(U,v) an d ^-m j -(U,V) denote the (respective) projection operators onto these 
subspaces. When the subspaces (U, V) are clear from context, we use the shorthand notation 
A" = n^i/[/y)(A) and A' = A — A". With this notation, we can define the restricted 

sets C of interest. Using the singular value decomposition, we can write 0* = UDV T , where 
U G M. kxk and V G WP xp are both orthogonal matrices, and D G M. kxp contains the singular 
values of 0*. For any positive integer r < min{A;,p}, we let (U r , V r ) denote the subspace pair 
defined by the top r left and right singular vectors of 0* . For a given integer r and tolerance 
6 > 0, we then define a subset of matrices as follows: 

C(r;5) :=|a g R kxp | |||A||| F > 5, \jA"\l < 3||A'|i + 4|II^x (u^y^ (0*) ||| 1 1 . (12) 

The next ingredient is the choice of the regularization parameter Aat used in solving the 
SDP 0. Our theory specifies a choice for this quantity in terms of the adjoint of the operator 
X— namely, the operator X* : R N -> R kxp defined by 

N 

%*{e): = J2 £ i X i- ( 13 ) 

i=i 

With this notation, we come to the first result of our paper. It is a deterministic result, which 
specifies two conditions — namely, an RSC condition and a choice of the regularizer — that 
suffice to guarantee for any solution of the SDP ([9]) fall within a certain radius. 
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Theorem 1. Suppose that the operator X satisfies restricted strong convexity with parameter 
k(X) > over the set C(r;5), and that the regularization parameter An is chosen such that 
An > 2 HI 3£* (e) HI op /iV . Then any solution Q to the semidefinite program ([9]) satisfies 

6 - 6 \\ F < max 5, 7^^, 

Apart from the tolerance parameter 5, the two main terms in the bound (|14p have a nat- 
ural interpretation. The first term (involving y/r) corresponds to estimation error, capturing 
the difficulty of estimating a rank r matrix. The second is an approximation error, in which 
the projection onto the set M.- i -{U r ,V r ) describes the gap between the true matrix 0* and 
the rank r approximation. 



16 A N PI 



M ± (U r ,V r ) 



k(X) 



(14) 



Let us begin by illustrating the consequences of Theorem [T] when the true matrix O* has 
exactly rank r, in which case there is a very natural choice of the subspaces represented by 
U and V. In particular, we form U from the r non-zero left singular vectors of O*, and 
V from its r non-zero right singular vectors. Note that this choice of {U, V) ensures that 
^M ± (U,V)(^ > *) = 0- F° r technical reasons to be clarified, it suffices to set 5 = in the case of 
exact rank constraints, and we thus obtain the following result: 

Corollary 1 (Exact low-rank recovery). Suppose that O* has rank r, and X satisfies RSC 
with respect to C(r;0). Then as long as \n > 2|||X*(e)||| p/A r , any optimal solution Q to the 
SDP Q satisfies the bound 

P-ei^^. (16) 

Like Theorem [H Corollary Q] is a deterministic statement on the SDP error. It takes a much 
simpler form since when G* is exactly low rank, then neither tolerance parameter 8 nor the 
approximation term are required. 

As a more delicate example, suppose instead that 0* is nearly low-rank, an assumption 
that we can formalize by requiring that its singular value sequence {<7j(6*)}°^ fc,p ^ decays 
quickly enough. In particular, for a parameter q € [0, 1] and a positive radius R q , we define 
the set 

min{fc,p} 

B q (R q ):={e€m kxp \ Wi(®)\ q <R q }- (16) 

i=l 

Note that when q = 0, the set Mq(Rq) corresponds to the set of matrices with rank at most 

Rq. 

Corollary 2 (Near low-rank recovery). Suppose that 0* G M q (R q ), the regularization parame- 
ter is lower bounded as An > 2|||3£*(e)||| p/iV, and the operator X satisfies RSC with parameter 
k(X) € (0,1] over the set C{R q \ N q ;5). Then any solution O to the SDP ([9]) satisfies 

|||e-e*||| F <max{<5, 32 y%( ^) }. (17) 
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Note that the error bound (fTTj) reduces to the exact low rank case (fT5|) when q = 0, and 5 = 0. 
The quantity XJ^Rq acts as the "effective rank" in this setting; as clarified by our proof in 
Section 14.21 This particular choice is designed to provide an optimal trade-off between the 
approximation and estimation error terms in Theorem [TJ Since A at is chosen to decay to zero 
as the sample size iV increases, this effective rank will increase, reflecting the fact that as we 
obtain more samples, we can afford to estimate more of the smaller singular values of the 
matrix 0* . 



3.2 Results for specific model classes 

As stated, Corollaries [T] and [2] are fairly abstract in nature. More importantly, it is not 
immediately clear how the key underlying assumption — namely, the RSC condition — can be 
verified, since it is specified via subspaces that depend on 0*, which is itself the unknown 
quantity that we are trying to estimate. Nonetheless, we now show how, when specialized to 
more concrete models, these results yield concrete and readily interpretable results. As will 
be clear in the proofs of these results, each corollary requires overcoming two main technical 
obstacles: establishing that the appropriate form of the RSC property holds in a uniform 
sense (so that a priori knowledge of 0* is not required), and specifying an appropriate choice 
of the regularization parameter Atv- Each of these two steps is non-trivial, requiring some 
random matrix theory, but the end results are simply stated upper bounds that hold with 
high probability. 

We begin with the case of rank-constrained multivariate regression. As discussed earlier 
in Example [H recall that we observe pairs (Yi, Zj) S M fc x W linked by the linear model 
Yi = @*Zi + Wi, where Wi ~ -/V(0, v 2 Ikxk) is observation noise. Here we treat the case 
of random design regression, meaning that the covariates Zi are modeled as random. In 
particular, in the following result, we assume that Zi ~ N(0, X), i.i.d. for some p-dimensional 
covariance matrix Yj ^- 0. Recalling that (7" max 

(£) and 

Cmin(5j) denote the maximum and 

minimum eigenvalues respectively, we have: 

Corollary 3 (Low-rank multivariate regression). Consider the random design multivariate 
regression model where 0* G M q (R q ). There are universal constants {c,,i = 1,2,3} such that 

if we solve the SDP @ with regularization parameter Xn = 10i/y/ fT max (S) \J , we have 
with probability greater than 1 — C2exp(— c%(k + p))- 

Remarks: Corollary [3] takes a particularly simple form when £ = I pX p'- then there exists a 
constant d x such that |||0 - Q*\\p < c\v 2 R q (khE) 1 " 9 / 2 . When 0* is exactly low rank — that 
is, q = 0, and 0* has rank r = Rq — this simplifies even further to 

|e-ei|<c'/ r(fc+p) . 

n 

The scaling in this error bound is easily interpretable: naturally, the squared error is propor- 
tional to the noise variance u 2 , and the quantity r(k + p) counts the number of degrees of 
freedom of a k x p matrix with rank r. Note that if we did not impose any constraints on 
0*, then since a k x p matrix has a total of kp free parameters, we would expect at best to 
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obtain rates of the order |||0 — ©*|||^ = Q( u ^ P )- Note that when 0* is low rank — in par- 
ticular, when r <C min{/c,p} — then the nuclear norm estimator achieves substantially faster 
rates. Finally, we note that as stated, the result requires that min{k,p} tend to infinity 
in order for the claim to hold with high probability. Although such high-dimensional scal- 
ing is the primary focus of this paper, we note that for application to the classical setting 
of fixed (k,p), the same statement (with different constants) holds with k+p replaced by logn. 



Next we turn to the case of estimating the system matrix 0* of an autoregressive (AR) 
model, as discussed in Example [2j 

Corollary 4 (Autoregressive models). Suppose that we are given n samples {^}™ =1 from 
a p- dimensional autoregressive process (|2|) that is stationary, based on a system matrix that 
is stable (|@*||op — 7 < -U; an d approximately low-rank (Q* £ M q (R q )). Then there are 
universal constants {ci,i = 1,2,3} such that if we solve the SDP Q with regularization 

parameter Xn = 8 °J^ op \/ f , then any solution G satisfies 



lie -e* |l <ci 



-q 



1-9/2 



R,^) (19) 



„,,„(£) (1-7) 
with probability greater than 1 — C2 exp(— c^p). 

Remarks: Like Corollary [21 the result as stated requires that p tend to infinity, but the same 
bounds hold with p replaced by log n, yielding results suitable for classical (fixed dimension) 
scaling. Second, the factor (p/n) 1 ^/ 2 , like the analogous terrrQ in Corollary [31 shows that 
faster rates are obtained if Q* can be well-approximated by a low rank matrix, namely for 
choices of the parameter q £ [0, 1] that are closer to zero. Indeed, in the limit q = 0, we again 
reduce to the case of an exact rank constraint r = Rq, and the corresponding squared error 
scales as rp/n. In contrast to the case of multivariate regression, the error bound (119p also 
depends on the upper bound |||0*||| p = 7 < 1 on the operator norm of the system matrix 
G*. Such dependence is to be expected since the quantity 7 controls the (in)stability and 
mixing rate of the autoregressive process. As clarified in the proof, the dependence of the 
sampling in the AR model also presents some technical challenges not present in the setting 
of multivariate regression. 



Finally, we turn to analysis of the compressed sensing model for matrix recovery, which was 
discussed in Example [3l The following result applies to the setting in which the observation 
matrices {X{\f =l are drawn i.i.d., with standard N(0, 1) elements. We assume that the 
observation noise vector e G satisfies the bound < 2u y/~N for some constant u, an 
assumption that holds for any bounded noise, and also holds with high probability for any 
random noise vector with sub-Gaussian entries with parameter v (one example being Gaussian 
noise N(0,u 2 )). 

Corollary 5 (Compressed sensing recovery). Suppose that G* E M q (R q ), and that the sample 
size is lower bounded as N > 200 R l q q ^ kp. Then any solution Q to the SDP Q satisfies 
the bound 



II©-©* If ^ 256 " 2 ~ q R 1 
with probability greater than 1 — c\ exp(— C2(k +p)). 




2-q 



(20) 



x The term in Corollary [3] has a factor k + p, since the matrix in that case could be non-square in general. 
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The central challenge in proving this result is in proving an appropriate form of the RSC 
property. The following result on the random operator X may be of independent interest here: 



Proposition 1. Under the stated conditions, the random operator X satisfies 




for all G R kxp 



(21) 



with probability at least 1 — 2 exp(— N/32). 

The proof of this result, provided in AppendixO, makes use of the Gordon-Slepian inequalities 
for Gaussian processes, and concentration of measure. As we show in Section 14.51 it implies 
the form of the RSC property needed to establish Corollary [5j 

Proposition Q] also implies an interesting property of the null space of the operator X; 
one that can be used to establish a corollary about recovery of low-rank matrices when the 
observations are noiseless. In particular, suppose that we are given the noiseless observations 
Hi = {{Xi, 0*)) for i = 1, . . . ,N, and that we try to recover the unknown matrix 0* by solving 



a recovery procedure that was studied by Recht et al. [H]. Proposition Q] allows us to obtain 
a sharp result on recovery using this method: 

Corollary 6. Suppose that 0* has rank r, and that we are given N > 40 2 r(k + p) noiseless 
samples. Then with probability at least 1 — 2exp(— N/32), the SDP (|22p recovers the matrix 
0* exactly. 

This result removes some extra logarithmic factors that were included in the earlier work |41j , 
and provides the appropriate analog to compressed sensing results for sparse vectors [I7J [12] . 
Note that (like in most of our results) we have made little effort to obtain good constants in 
this result: the important property is that the sample size N scales linearly in both r and 
k + p. 

4 Proofs 

We now turn to the proofs of Theorem [IJ and Corollaries Q] through [6j In each case, we 
provide the primary steps in the main text, with more technical details stated as lemmas and 
proved in the Appendix. 

4.1 Proof of Theorem [1] 

By the optimality of for the SDP Q, we have 



Defining the error matrix A = 0* — and performing some algebra yields the inequality 



the SDP 



min I O |||i such that ((Xi, 0)) = ?/j for all i = 1, . . . 

eeR fe ><p 



N 



(22) 




y - £(0)||| + Ajvieil! < ±-\\y- X(e*)\\t + Xn\\Q*\Ii. 



1 



||X(A)||| < l<r, X(A)> + A^{|||0 + A|||i - HIGIIIi}. 



(23) 



2N 
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By definition of the adjoint and Holder's inequality, we have 



l|(^(A))| = l| ( r(^A)| <1 



(e, X(A))| = -\{T(e), A)| < — |||3T (e)||| op IHAIHl (24) 



By the triangle inequality, we have |||0 + A|||i — |||0|||i < |||A|||i. Substituting this inequality 
and the bound (|24"|) into the inequality (|23|) yields 



^||X(A)|||<l|||r(e)||| op |||A||| 1 + A w |||A||| 1 < 2A Ar |||A||| 1 , 

where the second inequality makes use of our choice A^v > ^|||3£* (£)|||op- 

It remains to lower bound the term on the left-hand side, while upper bounding the 
quantity |||A|||i on the right-hand side. The following technical result allows us to do so. 
Recall our earlier definition (fTT|) of the sets M. and M. 1 - associated with a given subspace 
pair. 

Lemma 1. Let (U, V) represent a pair of r- dimensional subspaces of left and right singular 
vectors of 0* . Then there exists a matrix decomposition A = A' + A" of the error A such 
that 

(a) The matrix A' satisfies the constraint rank(A') < 2r, and 

(b) If \n > 2|||X*(e)|||2/iV, then the nuclear norm of A" is bounded as 

HA'% < SWA'H + 4|||n_ MX(r ^ ) (0*)||| 1 (25) 

See Appendix |A] for the proof of this claim. Using Lemma [H we can complete the proof 
of the theorem. In particular, from the bound ([25]) and the RSC assumption, we find that 

±\\X(A)g>K(X) |||A||||. 



Using the triangle inequality together with inequality (|25p . we obtain 

HIAlx < lA'llli + I A" Hi < MIA'H + 4|||n M x(0*)|||i. 

From the rank constraint in Lemma QJa), we have IIA'Hi < V2r\\ A'\\f. Putting together the 
pieces, we find 

k(£)|A||! < max{32A 7 v^ : |A||| F , 16 Atv ffln^x (0*)|||i }, 
which implies that 

J 32\ N ^F ^16A A r||n >I x(0*)||iy/2- 

l ^ F _ max 

as claimed. 
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4.2 Proof of Corollary M 



Let m = min{k,p}. In this case, we consider the singular value decomposition O* = UD T V , 
where U G R fcxm and V G M pxm are orthogonal, and we assume that D is diagonal with the 
singular values in non-increasing order oi(0*) > 02(0*) > . . . cr m (0*) > 0. For a parameter 
r > to be chosen, we let K = {i | 0"i(0*) > r}, and let U (respectively V ) denote the 
A; x | if | (respectively the p x \K\) orthogonal matrix consisting of the first \K\ columns of U 
(respectively V). With this choice, the matrix Q* K c ■= ^m a -(U k ,V K )(®*) nas ran k at most 
m — \K\, with singular values {o"j(0*),i G if c }. Moreover, since <jj(0*) < r for all i G if c , we 
have 
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/Hill = T 



i=|A"|+l 



< T l -iR q . 



On the other hand, we also have R q > X)i=i l "^©*)! 9 — l-^l r<7 > which implies that 
\K\ < r~ q R q . From the general error bound with r = \K\, we obtain 



|||0 — 0*|||f < max 
Setting t = \n/k yields that 



32A 



N 



16 X N r l - q R q 



1/2 



10 — 0*1 f < max 



32A 



1-9/2 

TV 



i?„ 



32 \/ i? n 



16 A^r q R q 
^9 



1/2 



as claimed. 



4.3 Proof of Corollary H 



For the proof of this corollary, we adopt the following notation. We first define the three 
matrices 



X 



t>nxp 



Y 



Y 1 



htlXt 



and W 



nnxk 



(26) 



With this notation and using the relation N = nk, the SDP objective function ([9]) can be 
written as ^{^l^ — -^©If + ^n|||@ll|i}) where we have defined X n = Xn k. 

In order to establish the RSC property for this model, some algebra shows that we need 
to establish a lower bound on the quantity 



1 

2n 



\XA\f F 



^£ll(*A)il@ > 



0"n 



{XTX ^ IIIAIIP 

~ III ^ III F! 



2n 



where cr m i n denotes the minimum eigenvalue. The following lemma follows by adapting known 
concentration results for random matrices (see the paper [47] for details): 
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0"mm ( — X T X 



> 



-X T X) <9a max (X) 



> 1 - 4exp(-n/2). 



Lemma 2. Zei X G R nxp 6e a random matrix with i.i.d. rows sampled from a p-variate 
N(0, S) distribution. Then for n > p, we have 

^ n ( l 

, ) "max I 

) \n 

As a consequence, we have CTmm 2 ~^ — > crm ' 1 "g S ^ with probability at least 1 — 4exp(— n) for all 
n > p, which establishes that the RSC property holds with k(3L) = ^o" m i n (S). 

Next we need upper bound the quantity |||j£*(e)|||2 for this model, so as to verify that the 
stated choice for Xjy is valid. Following some algebra, we find that 

-|||X*(e)||| op = -|||X T W||| op . 
n n 

The following lemma is proved in Appendix iBl 
Lemma 3. There are constants q > such that 
' 1 



||A^ VF|| op | >5z^/|||£ 



Hop 



k + p 



< ci exp(-c 2 (A: + p))- 



(27) 



1 n ' v y n 

Using these two lemmas, we can complete the proof of Corollary [3) First, recalling the 

scaling N = kn, we see that Lemma [3] implies that the choice An = lO^y^f^l^y^^ satisfies 
the conditions of Corollary [2] with high probability. Lemma [2] shows that the RSC property 
holds with k(X) = a m i n (S)/20, again with high probability. Consequently, Corollary [2] implies 
that 

2-9 



|e -e* ml < 32 2 R q ( wu 



Hop 



x(=) 



l^lll op 
1-9/2 



k+p 



20 



n 



k + p 



Cmin(£) 
1-9/2 



with probability greater than 1 — C2exp(— c%{k + p)), as claimed. 
4.4 Proof of Corollary |4] 



For the proof of this corollary, we adopt the notation 



X 



z 



pnxp 



and Y 



ZT 
ZT 



7 T 



nnxp 



Finally, we let W G ]R nxp be a matrix with i.i.d. N(0, u 2 ) elements, corresponding to the 
innovations noise driving the AR process. With this notation and using the relation N = np, 
the SDP objective function ([9]) can be written as II^H^ — ^0|||' + A n |G|i}, where we have 
defined A n = An P- At a high level, the proof of this corollary is similar to that of Corollary [3l 
in that we use random matrix theory to establish the required RSC property, and to justify 
the choice of A n , or equivalently An- However, it is considerably more challenging, due to the 
dependence in the rows of the random matrices, and the cross-dependence between the two 
matrices X and W (which were independent in setting of multivariate regression). 



The following lemma provides the lower bound needed to establish RSC for the autore- 
gressive model: 
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Lemma 4. The eigenspectrum of the matrix X T X/n is well- controlled in terms of the sta- 
tionary covariance matrix: in particular, as long as n> c^p, we have 

'-((M) '< - <^((^))|>^, ,28) 

both with probability greater than 1 — 2c\ exp(— C2 p). 

Thus, from the bound (|28])(b). we see with the high probability, the RSC property holds with 
k(3l) = a m i n (E)/4 as long as n > c^p. 

As before, in order to verify the choice of Xn, we need to control the quantity i|||X T VF||| p. 
The following inequality, proved in Appendix lC.2l yields a suitable upper bound: 

Lemma 5. There exist constants C{ > 0, independent ofn,p,Y> etc. such that 

F[-\\X T W\l P > Ifflk [I] < C2 exp(- C3 p). (29) 
L n 1 — 7 V n 

From Lemma [U we see that it suffices to choose A at = 80 j^jjj op y^f- With this choice, Corol- 
lary [2] of Theorem [T] yields that 

III /a Cl*|||2 <- „ p <7max(^) 

f - e> \\ F < c\ n q — — 

L0"min(S) (1 - 7)_ 

with probability greater than 1 — C2 exp(— c^p), as claimed. 



px 1-9/2 

n J 



4.5 Proof of Corollary [5] 

Recall that for this model, the observations are of the form = {{Xi, 0*}} + £j, where 
O* £ M. kxp is the unknown matrix, and {Ei}f =l is an associated noise sequence. 

Let us now show how Proposition[T]implies the RSC property with an appropriate tolerance 

parameter. In particular, let us define 5 2 := R q [yj^ + \J~^f\ 2 ? > so that if we have the 
inequality |||A|||^ < 5, the result of Corollary [5] follows immediately. Therefore, we may take 
II A II ^ > 5. Now recall from Lemma [1] that the error A satisfies the bound (|25p . Combining 
these facts, we are guaranteed that A 6 C(r;8), where the set C was previously defined (|12p . 
and it is sufficient to establish the RSC property over this set. 
Observe that the bound (f2Tj) implies that for any A G C, 



M A >ll>>Wf# + ./I||A| 1 . (30) 



~ 4 \V N V N 
Following the arguments used in the proofs of Theorem [1] and Corollary [21 we find that 



lAll! < 41^111! + 4|||n A <x(e*)|||i < 4 J2R q T-i\\A'\lF + AR qT l ~\ (31) 



where r > is a parameter to be chosen. We now set r = (s/k + yp) /V~N, and substitute 
the resulting bound (131|) into equation (130p . thereby obtaining 

> W _ ^2R qT ^ \\A\\f - 4R q r^ 
VN 4 

> -|||A||| F - \/32<5|A|| F - 4<5|||A||| F . 
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If we choose N > 200 R { q q/2) kp, then we are guaranteed that | — (4 + \/32)<5 > |, which 
shows that the RSC property holds with k(X) = 1/8. 

The next step is to control the quantity ||£*(e)||2/-/V, required for specifying a suitable 
choice of Ajy. 



Lemma 6. // ||e||2 < 2v\JN , then 



!£&>4J* + ./T 

N ~ \ V N V N , 



< ci exp(-c 2 (A; +jp)). 



(32) 



Proof. By definition of the adjoint operator, we have jrX*(e) = jrY?,iLi £ iXi- Since the 
observation matrices {Xi]f =l are i.i.d. Gaussian, if the sequence {Ei}f =l is viewed as fixed 
(by conditioning as needed), then the random matrix Z := X/i=i £ iXi has zero- mean 

Hell 2 h 

i.i.d. Gaussian entries with variance jyjr- Since Z € R p , known results in random matrix 
theory [16] imply that 



lollop ^ 2 




< 2exp(-c 2 (/c + p)), 



as claimed. 



□ 



4.6 Proof of Corollary [6] 

This corollary follows from a combination of Proposition [1] and Lemma [TJ Let O be an optimal 
solution to the SDP (|22p . and let A = 0—0* be the error. Since G is optimal and 0* is feasible 
for the SDP, we have |||0|||i = |||G* + A|||i < |||0*|||i. Using the decomposition A = A' + A" from 
Lemma [T]and applying triangle inequality, we have |||0* + A' + A"|i > |||G* + A"|i — ||| A' ||| i . 
From the properties of the decomposition in Lemma Q] (see Appendix [A]) , we find that 

I G Hi = |||e* + A' + A" Hi > I G* Hi + I A" Hi - I A' Hi. 

Combining the pieces yields that |A"|i < lA'li, and hence |||A|||i < 2|||A / |||i. By Lemma [T{a), 
the rank of A' is at most 2r, so that we obtain |||A|||i < 2v / 2r|A|^ < 4r|||A|||ir. 

Note that 3£(A) = 0, since both G and G* agree with the observations. Consequently, 
from Proposition [U we have that 




> |||A||| F /20, 



where the final inequality follows from the assumption that ./V > 40 2 r(/c +p). We have thus 
shown that A = 0, which implies that G = G* as claimed. 
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5 Experimental results 



In this section, we report the results of various simulations that demonstrate the close agree- 
ment between the scaling predicted by our theory, and the actual behavior of the SDP-based 
M-estimator ([9]) in practice. In all cases, we solved the convex program © by using our 
own implementation in MATLAB of an accelerated gradient descent method which adapts 
a non-smooth convex optimization procedure [36] to the nuclear-norm |26j . We chose the 
regularization parameter A at in the manner suggested by our theoretical results; in doing so, 
we assumed knowledge of quantities such as the noise variance v 2 . (In practice, one would 
have to estimate such quantities from the data using standard methods.) 

We report simulation results for three of the running examples discussed in this paper: 
low-rank multivariate regression, estimation in vector autoregressive processes, and matrix 
recovery from random projections (compressed sensing). In each case, we solved instances of 
the SDP for a square matrix 0* £ M pxp , where p 6 {40, 80, 160} for the first two examples, 
and p € {20, 40, 80} for the compressed sensing example. In all cases, we considered the case 
of exact low rank constraints, with rank(0*) = r = 10, and we generated 0* by choosing 
the subspaces of its left and right singular vectors uniformly at random from the Grassman 
manifold. The observation or innovations noise had variance v 2 = 1 in each case. The VAR 
process was generated by first solving for the covariance matrix X using the MATLAB function 
dylap and then generating a sample path. For each setting of (r,p), we solved the SDP for a 
range of sample sizes N. 



Error versus sample size 



4- P = 


40 




80 




160 



h -2 



Error versus rescaled sample size 





40 




80 




160 



4000 6000 
Sample size 



(a) 



4 6 
Rescaled sample size 



10 



(b) 



Figure 1. Results of applying the SDP (J9j> with nuclear norm regularization to the problem of 
low-rank multivariate regression, (a) Plots of the Frobenius error |||0 — ®*\\f on a logarithmic 
scale versus the sample size N for three different matrix sizes p £ {40, 80, 160}, all with rank 
r = 10. (b) Plots of the same Frobenius error versus the rescaled sample size N/(rp). Consistent 
with theory, all three plots are now extremely well-aligned. 

Figure [1] shows results for a multivariate regression model with the covariates chosen 
randomly from a N(Q,I) distribution. Panel (a) plots the Frobenius error |||0 — 0*|||f on a 
logarithmic scale versus the sample size N for three different matrix sizes, p S {40,80, 160}. 
Naturally, in each case, the error decays to zero as N increases, but larger matrices require 
larger sample sizes, as reflected by the rightward shift of the curves as p is increased. Panel 
(b) of Figure Q] shows the exact same set of simulation results, but now with the Frobenius 
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error plotted versus the rescaled sample size N : = Nf (rp) . As predicted by Corollary [3l the 
error plots now are all aligned with one another; the degree of alignment in this particular 
case is so close that the three plots are now indistinguishable. (The blue curve is the only one 
visible since it was plotted last by our routine.) Consequently, Figure Q] shows that N/(rp) 
acts as the effective sample size in this high-dimensional setting. 

Figure [2] shows similar results for the autoregressive model discussed in Example [2j As 
shown in panel (a), the Frobenius error again decays as the sample size is increased, although 
problems involving larger matrices are shifted to the right. Panel (b) shows the same Frobe- 
nius error plotted versus the rescaled sample size N/ (rp) ; as predicted by Corollary HJ the 
errors for different matrix sizes p are again quite well-aligned. In this case, we find (both in 
our theoretical analysis and experimental results) that the dependence in the autoregressive 
process slows down the rate at which the concentration occurs, so that the results are not as 
crisp as the low-rank multivariate setting in Figure [TJ 




Sample size Rescaled sample size 

(a) (b) 

Figure 2. Results of applying the SDP © with nuclear norm regularization to estimating the 
system matrix of a vector autoregressive process, (a) Plots of the Frobenius error ||0 — S*|]|f on 
a logarithmic scale versus the sample size N for three different matrix sizes p £ {40, 80, 160}, 
all with rank r = 10. (b) Plots of the same Frobenius error versus the rescaled sample size 
N/ (rp) . Consistent with theory, all three plots are now reasonably well-aligned. 

Finally, Figure [3] presents the same set of results for the compressed sensing observation 
model discussed in Example El Even though the observation matrices Xi here are qualitatively 
different (in comparison to the multivariate regression and autoregressive examples) , we again 
see the "stacking" phenomenon of the curves when plotted versus the rescaled sample size 
N/rp, as predicted by Corollary 

6 Discussion 

In this paper, we have analyzed the nuclear norm relaxation for a general class of noisy 
observation models, and obtained non-asymptotic error bounds on the Frobenius norm that 
hold under high-dimensional scaling. In contrast to most past work, our results are applicable 
to both exactly and approximately low-rank matrices. We stated a main theorem that provides 
high-dimensional rates in a fairly general setting, and then showed how by specializing this 
result to some specific model classes — namely, low-rank multivariate regression, estimation 



18 




Figure 3. Results of applying the SDP (J9j) with nuclear norm regularization to recovering 
a low-rank matrix on the basis of random projections (compressed sensing model) (a) Plots 
of the Frobenius error |0 — Q*\\f on a logarithmic scale versus the sample size N for three 
different matrix sizes p € {20,40,80}, all with rank r — 10. (b) Plots of the same Frobenius 
error versus the rescaled sample size N/(rp). Consistent with theory, all three plots are now 
reasonably well-aligned. 



of autoregressive processes, and matrix recovery from random projections — it yields concrete 
and readily interpretable rates. Lastly, we provided some simulation results that showed 
excellent agreement with the predictions from our theory. 

This paper has focused on achievable results for low-rank matrix estimation using a par- 
ticular polynomial-time method. It would be interesting to establish matching lower bounds, 
showing that the rates obtained by this estimator are minimax-optimal. We suspect that this 
should be possible, for instance by using the techniques exploited in Raskutti et al. [39] in 
analyzing minimax rates for regression over ^-balls. 
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A Proof of Lemma [T] 

Part (a) of the claim was proved in Recht et al. [3TJ; we simply provide a proof here for 
completeness. We write the SVD as G* = UDV T , where U G R kxk and V G W xp are 
orthogonal matrices, and D is the matrix formed by the singular values of 0* . By re-ordering 
as needed, we may assume without loss of generality that the first r columns of U (respectively 
V) correspond to the matrices U (respectively V") from the statement. We then define the 
matrix T = U T AV G K fcxp , and write it in block form as 

where r u G R rxr , and T 22 G R( k ~ r ) x (P~ r ) . 



Til 

r 2 i 



Tl2 
T22 
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We now define the matrices 



A" = U 





o r 22 



v 1 



and A' = A - A". 



Note that we have 



rank 




< rank 




+ rank 


Tn 


0" 




r 21 o _ 







r 2 i 






< 2r, 



which establishes Lemma [T^a). Moreover, we note for future reference that by construction 
of A", the nuclear norm satisfies the decomposition 



u m(u,v)( & *) + A "lli = lirW.^enii + |A"|i. 



M(u,vy 



(33) 



We now turn to the proof of Lemma [2(b). Recall that the error A = — 0* associated 
with any optimal solution must satisfy the inequality (|23p . which implies that 

o<l(e, x(A)) + A A r{|||e*||| 1 -|e|| 1 } < |lr(e)||| op |||A||| 1 + A J v{|||e*||| 1 -|||e||| 1 }, (34) 



N 



where we have used the bound (j2i 

Using the triangle inequality and the relation (|33p . we have 

||6||i = 11(11^(6*) + a") + (n M± (e*) + A')||i 
>||(n^(e*) + A'0l|i-|(n A ,x(e*) + A0l|i 
> mw(©*)|i + |A"|i - (icn^ce*)!! + |a'|i}. 

Consequently, we have 

|e*|i - |eiu < me*!! - {\\ii M (&*)\h + |A"|x} + {|(n M x(0*)||i + |A'||i} 
= 2|||n A ,x(e*)||| 1 + |||A / ||| 1 -|A // ||i. 



Substituting this inequality into the bound (|34p . we obtain 

< (e)\\o P ll|A|||i + X N {2\\U M± (Q*)\h + |A'||i - lA''^}. 

Finally, since |||^j£*(e)|| p < Ajv/2 by assumption, we conclude that 

< X N {2\\U M± (Q*)\h + |||A'|i - i|||A"|||i}, 



from which the bound (j25[) follows. 



B Proof of Lemma [3] 

Let 5 ,p_1 = {u £ W | \\u\\2 = 1} denote the Euclidean sphere in p-dimensions. The operator 
norm of interest has the variational representation 

-\\X T W\\ op = - sup sup v T X T Wu 
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For positive scalars a and b, define the (random) quantity 

ty(a,b) := sup sup {Xv,Wu). 

and note that our goal is to upper bound ¥(1, 1). Note moreover that ^(a, b) = a b^f(l, 1), a 
relation which will be useful in the analysis. 

Let A = {u 1 , . . . ,u A } and B = {v 1 , . . . ,v B } denote 1/4 coverings of S k ~ 1 and S 75 " 1 , 
respectively. We now claim that we have the upper bound 

¥(1,1) < 4 max (Xv b , Wu a ) (35) 
u a eA,v b eB 

To establish this claim, we note that since the sets A and B are 1/4-covers, for any pair 
(u, v) G S^ -1 x 5 P_1 , there exists a pair (u a ,v b ) G A x B such that u = u a + Au and 
v = v b + Av, with max{||Ati||2, \\Av H2} < 1/4. Consequently, we can write 

(lu, W«) = (Xv b , Wu a ) + (Xv b , WAu) + (XAv, Wu a ) + (IAu, WAu). (36) 

By construction, we have the bound \(Xv b , WAu)\ < ¥(1,1/4) = ^¥(1,1), and similarly 
\{XAv, Wu a )\ < |¥(1,1) as well as \ {XAv, WAu)\ < ^¥(1,1). Substituting these bounds 
into the decomposition (136j) and taking suprema over the left and right-hand sides, we conclude 
that 

¥(1,1) < max (Xv b , Wu a ) + —¥(1,1), 

u a &A,v b £B 16 



from which the bound (|35p follows. 

We now apply the union bound to control the discrete maximum. It is known (e.g., 
that there exists a 1/4 covering of S k ~ l and S^" 1 with at most A < 8 k and B < 8 P elements 
respectively. Consequently, we have 

'\(Xv b , Wu a )\ 

jr 



P[|¥(l,l)| >A8n] <8 fc+p max 



> 5 



(37) 



It remains to obtain a good bound on the quantity ~(Xv, Wu) = ~Y^i=\( v i Xi){u, Wi), 
where (u, v) G S k ~ 1 x S^ -1 are arbitrary but fixed. Since W{ G P fc has i.i.d. N(0, u 2 ) elements 
and u is fixed, we have Z{ : = (u, Wi) ~ N(0, v 2 ) for each i = 1, . . . , n. These variables are 
independent of one another, and of the random matrix X. Therefore, conditioned on X, the 
sum Z : = - ^2i = i(v, Xi)(u, Wi) is zero-mean Gaussian with variance 

a 2 : = — (-\\Xv\\i) < —\\X T X/n\L p . 
n \n J n 

Define the event T = {a 2 < 9t/2|l [f ll|op }. Using Lemma we have \\X T X/n\l op < 9cr max (£) 
with probability at least 1 — 2 exp(— n/2), which implies that P[T C ] < 2exp(— n/2). Therefore, 
conditioning on the event T and its complement T c , we obtain 

F[\Z\ >t]< F[\Z\ >t\T]+ P[T C1 

t 2 



~ 6XP f " n 2^(4 + |||S||| op ) , ) + 2ex P(-/ 2 )- 



Combining this tail bound with the upper bound (|37l) . we have 

P[|^(l, 1)| >A5n]< 8 k+p jexp + 2exp(-n/2)| . 



Setting t 2 = 20^ 2 |||X||| op ^£, this probability vanishes as long as n > 16(A; +p). 
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C Technical details for Corollary [4] 

In this appendix, we collect the proofs of Lemmas 0] and 



C.l Proof of Lemma [4] 

Recalling that 5 P_1 denotes the unit-norm Euclidean sphere in p-dimensions, we first observe 
that I -X" |||op = sup ne>5 p-i ||Xn||2. Our next step is to reduce the supremum to a maximization 
over a finite set, using a standard covering argument. Let A = {u 1 , . . . ,u A } denote a 1/2- 
cover of it. By definition, for any u £ S p ~ l , there is some u a £ A such that u = u a + Au, 
where ||Au||2 < 1/2. Consequently, for any u £ S' p ~ 1 , the triangle inequality implies that 

\\Xu\\ 2 < \\Xu a \\ 2 + ||XAn|| 2 , 
and hence that |||X|| p < max M « 6 _4 ||Xit a ||2 + ^|||X||| p. Re-arranging yields the useful inequality 



-X II op < 2 max ||Xn a ||2. 



(38) 



Using inequality (|38|) . we have 



1 rp 

— \jX I op > t 

n 



< 



< 4 P max 
u a eA 



1 t 

max-V((u a , X,)) 2 > - 

=i 

n 

-£(K,X,» 2 > 



i=l 



i=l 



(39) 



where the last inequality follows from the union bound, and the fact [29^ [33] that there exists 
a 1/2-covering of S v ~ x with at most 4 P elements. 

In order to complete the proof, we need to obtain a sharp upper bound on the quantity 

P[£ X i)) 2 > |]> valid for an y fixed u G 5 ' P " 1 - Define the random vector Y £ W 1 

with elements = (u, X^. Note that V is zero mean, and its covariance matrix R has 
elements Rij = E[Y^Yj] = -u T X(0*)l jf ~ J l u. In order to bound the spectral norm of R, we note 
that since it is symmetric, we have |-R||| p < max J^?=i l-^yl> and moreover 



\Ri 



i=l,...,p 

^-^ui < rue 



|u S(6* 



op J 



S < 7 



II op- 



Combining the pieces, we obtain 



liJllop^maxVlTll^llSllop < 2|||£||| op V | 7 | J < -f 

i 1 — » * — * 1 

j=l j=0 



2IIIEI 



op 



(40) 



Moreover, we have trace(i?)/n = u T T,u < |||£|| p 
conclude that 



1 



\Y\\ 2 



> lollop + 5 



\R\ 



n \ n 

Combined with the bound (1391). we obtain 



op 



Applying Lemma [8] with t 
< 2 exp ( - bp) + 2 exp -n/2). 



1 T 

-X -XI Lp < S Lp 

n 



2 + 



20 



(1-7) 



- !> < 

71 



24 1 S| 



op 



(1-7) ' 



we 



(41) 
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with probability at least 1 — c± exp(— C2P), which establishes the upper bound ([28]) (a). 

Turning to the lower bound (|28j) (b). we let B = {v 1 , . . . ,v B } be an e-cover of 5 P_1 for 
some e G (0, 1) to be chosen. Thus, for any v G MP, there exists some v b such that v = u^ + Av, 

and ||A-y|| 2 < e. Define the function * : W x R p — > R via ^(u,v) = u T ^X T X^jv, and note 

that *$>(u,v) = *$>(v,u). With this notation, we have 

i' T ^I T l)w = -$>(v,v) = ^(v k ,v k ) + 2^(Av,v) + V(Av,Av) 

> ^(v k ,v k ) + 2^(Av,v), 

since ^(Av, Av) > 0. Since |^(Aw,w)| < e ||| ( ^X T X J | op , we obtain the lower bound 



1 "TtA\_ • r „.T(^vT v \ n . ^ •„ it,r„.b „.6\ o,lll 1 vT 



a min [ (-X 1 X) ) = inf v I (-X I X)v > mmy(v b ,v b ) - 2e\\-X 1 X\\ op . 

\\n ) ) veSp- 1 \n ) v b £B n 

By the previously established upper bound® (a), have |±X T Al op < ^f^E wit h high 

probability. Hence, choosing e = ^oolfsl"^ ensures that 2e|||^X T X||| p < <r m i n (S)/4. 

Consequently, it suffices to lower bound the minimum over the covering set. We first 
establish a concentration result for the function ty(v,v) that holds for any fixed v G S p ~ . 
Note that we can write 



n 

*( v , v ) = -Y / ((v,X l )f 



n 

i=l 



As before, if we define the random vector Y G W 1 with elements Yi = {v, Xi), then Y ~ 
N(0,R) with |||i?||| p < ^glsi. Moreover, we have trace(i?)/n = v T Ilv > a min (S). Conse- 
quently, applying Lemma [8] yields 



II Vl|2 , „ 8t 11^ III OP 
_ \\ Y 2 < ^minO) : 

n 1 — 7 



< 2exp ( - n(t - 2/ v / n) 2 / 2 ) + 2exp(- 



Note that this bound holds for any fixed v G S p 1 . Setting t* = ^ iaw^ and applying 
the union bound yields that 

F[mm*(v b ,v b ) < <r min (£)/2] < (-) p {2exp ( - n(t* - 2/^) 2 /2) + 2exp(-~)V 

which vanishes as long as n > il °f«y{ e ^ p- 
C.2 Proof of Lemma [5] 

Let S' p_1 = {«£!'' | ||n||2 = 1} denote the Euclidean sphere inp-dimensions, and for positive 
scalars a and b, define the random variable ^(a, b) : = sup uea 5 P -i sup^^-i (Xv, Wu). Note 
that our goal is to upper bound ^(1, 1). Let A = {u 1 , . . . , u A } and B = {v 1 , . . . , v B } denote 
1/4 coverings of S'^ 1 and S p ~ 1 , respectively. Following the same argument as in the proof of 
Lemma [3l we obtain the upper bound 

*(1,1) < 4 max (Xv b , Wu a ) (42) 
u a eA,v b eB 
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We now apply the union bound to control the discrete maximum. It is known (e.g., [29 . 33J) 
that there exists a 1/4 covering of S^ -1 with at most 8 P elements. Consequently, we have 

Pri^(l, 1)1 >4<5nl <8 2p max p H^' > 6]. (43) 

u a ,v b n 

It remains to obtain a tail bound on the quantity P [ \( Xv 'Wu)\ y g\ ^ £ Qr fixed pair 
(u,v) G Ax B. 

For each i = 1, . . . ,re, let X{ and W% denote the i th row of X and W. Following some 
simple algebra, we have the decomposition ( Xv >Wv) — ^ _ 7" 2 _ J" 3 5 where 

i n 1 

i=i 

8=1 

1 11 1 
8=1 

We may now bound each Tj,j = 1,2,3 in turn; in doing so, we make repeated use of 
Lemma [H which provides concentration bounds for a random variable of the form H^Hl) 
where Y ~ iV(0, Q) for some matrix Q y 0. 



Bound on T2: We begin with T2, which the easiest to control since (up to scaling by v), it 
corresponds to the deviation away from the mean of x 2 -variable with n degrees of freedom. 
Consequently, applying Lemma [8] with Q = I, we obtain 

P[|T 2 | > Au 2 t] < 2 exp ( - n (t ~ ) + 2 exp(-n/2) . (44) 

Bound on T3: We can write the term T3 as a deviation of HVH^/^ from its mean, where in 
this case the covariance matrix Q is no longer the identity. In concrete terms, let us define 
a random vector Y G R n with elements Y^ = (v, Xi). As seen in the proof of Lemma [5] 
from Appendix IC-H the vector Y is zero- mean Gaussian with covariance matrix R such that 
III -Rill op < (see equation (ISP]) ). Since we have trace(i?)/n = v T Rv, applying Lemma [8] 

yields that 

P[|T S | > ^k t ] < 2exp(- n( ^ ~ 2 2/ ^ )2 )+2exp(-n/2). (45) 

Bound on T\\ To control this quantity, let us define a zero-mean Gaussian random vector 
Z G W 1 with elements 2j = (v, Xi) + (tt, Wj). This random vector has covariance matrix S 
with elements 

Sij = E[ZiZj] = v 2 b ij + (1 - 6 ij yv T (e*)\ i - j \- 1 n + ^(9*)^'^, 
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where 5ij is the Kronecker delta for the event {i = j}. As before, by symmetry of S, we have 
fop < max i= i ) ... in Y%=i and hence 

j— 1 n 

<„2 + | S || op + ^{^f-^u + v T (e*f-^v\ + W^^f-^u + v T (@*f-^x 



op 



3=1 

oo 



j=i+l 



<u 2 + ||S||op + 2^ z/V" 1 + 2 mhr* 

3=1 3=1 



<u 2 + 1 Slop + 



2^ 



+ 



Hop 



1 — 7 1 — 7 

Morever, we have trace(<S)/n = v 2 + v T Q*v, so that by applying LemmaO we conclude that 



I T-,1 > 



I2v 2 12|||S| 
+ 



op 



1 — 7 1 — 7 



< 2 exp 



n(i - 2/y/ny 



+ 2exp(-n/2), (46) 



which completes the analysis of this term. 

Combining the bounds (|45p . (|44p and (|46p . we conclude that for all t > 0, 



> 20(|S|| O p + ^)t | 



1-7 



n{t - 2/Vn) 2 



+ 6exp(-n/2). (47) 



Setting t = lOyfp/n and combining with the bound (l4l3j) . we conclude that 
400(|||S||| op + ^) 



P[h/>(M)| > 

1 — 7 

as long as n > ((4 log 8) + l)p. 



-1 < 8 2p |6exp(-16p) + 6exp(-n/2)| < 12 exp (-p) 
n 



D Proof of Proposition [T] 

Note that ||X(6)||2 = svp ue gN-i{u, 36(0)), and that since the claim (|2Tj) is invariant to 
rescaling, it suffices to prove it for all 6 G M fcxp with 161^ = 1. Letting t > 1 be a given 
radius, we seek lower bounds on the quantity 

Z*(t) : = inf sup (u, X(9)), where = {0 G M fcxp | |||0||| F = 1, |||e|||i < *}. 

&en(t) ue5 iv-i 

In particular, our goal is to prove that for any t > 1, the lower bound 

^>7"[^] 1/2 * (48) 

holds with probability at least 1— c\ exp(— C2N). By a standard peeling argument (see Raskutti 
et al. [39] for details), this lower bound implies the claim (i2~Tj) . 

We establish the lower bound (|48p using Gaussian comparison inequalities [29] and con- 
centration of measure (see Lemma |7|) . For each pair (tt,9) G S^" 1 x 7£(i), consider the 
random variable Z Ut Q = (u, £(©)), and note that it is Gaussian with zero mean. For any two 
pairs (u, 0) and (u',Q), some calculation yields 

E[(Z u , e - ^n',e0 2 ] = Ilk ® @ - «' ® Q'IIIf- (49) 
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We now define a second Gaussian process {Y u> q \ (u, 0) G S N 1 x TZ(t)} via 

F u ,e:=( 5 , + 9)), 

where g G and G G M fcxp are independent with i.i.d. iV(0, 1) entries. By construction, 
Y Ui q is zero-mean, and moreover, for any two pairs (u, 0) and (it', 0'), we have 



E[(Y U)G -y u , 



(-)' 



u — u 



/||2 



+ ll|0-0 , lll 2 P . 



It can be shown that for all pairs (u, 0), (V, 0') G 1 x 1Z(t), we have 

||u (8) - u' <8> 0'H^ < ||u - u'Hl + |||0 - e'f F . 



(50) 



(51) 



Moreover, equality holds whenever = 0'. The conditions of the Gordon-Slepian inequal- 
ity [29] are satisfied, so that we are guaranteed that 



E[ inf ||X(G)|| 2 ] 



E 



inf sup Z u e 



> E 



We compute 



E 



Y u ,e 


= E 


sup (g, u) 


+ E 


i 









inf sup Y u & 

eeK(f) ugS n-i 



inf ((G, 0)) 
eeft(t) 



(52) 



= E[Wh]-E[ sup ({G, 0))] 
>^-tE[|||G||| op ]. 



Since G G M fcxp has i.i.d. N(0, 1) entries, standard random matrix theory [16] implies that 



El 



op 



< yk + yfp. Putting together the pieces, we conclude that 
E[ inf i»]>I-^t 



Finally, we need to establish sharp concentration around the mean. Note that the function 
/(£) := infQ g 7£( t ) ||3£(0)||2/\/iV is Lipschitz with constant 1/y/N, so that Lemma [7] implies 
that 



eeK(t) vW 2 



N 



<2exp(-N5 2 /2) for all 6 > 0. 



Setting 5 = 1/4 yields the claim. 

E Some useful concentration results 

The following lemma is classical [291 [32] , and yields sharp concentration of a Lipschitz function 
of Gaussian random variables around its mean. 

Lemma 7. Let X G M. n have i.i.d. N(Q, 1) entries, and let and f : W 1 — > R be Lipschitz with 
constant L (i.e., \f(x) — f(y)\ < L\\x — y\\2 Vx, y G W 1 ). Then for all t > 0, we have 

F[\f(X)-Ef(X)\ > t] < 2exp(-^). 
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By exploiting this lemma, we can prove the following result, which yields concentration of the 
squared ^-norm of an arbitrary Gaussian vector: 



Lemma 8. Given a Gaussian random vector Y ~ iV(0, Q), for all t > 2/y/n, we have 
1 



n 



\Y\\ 2 - traceQ > At 



op 



< 2exp 7T^—\ +2exp(-n/2). (53) 



Proof. Let yfQ be the symmetrix matrix square root, and consider the function f{x) = \\y/Qx\\2 
Since it is Lipschitz with constant ll-v/Qlop/v^ Lemma [7] implies that 

P[| \\y/QX\\z-E\\y/QX\\ 2 I >V^S] <2exp(-^j-) for all 5 > 0. (54) 

By integrating this tail bound, we find that the variable Z = \\y/QX\\ 2 /y/n satisfies the bound 
var(Z) < 4|||Q||| p/?i, and hence conclude that 



|Vlp2]-|E[Z]|| = \y/tTace(Q)/n-E[\\y/QX\\ 2 /y/^\ < 
Combining this bound with the tail bound (j54"|) . we conclude that 

< 2 exp 



op 



(55) 



r l 



-=\\\y/QX\\ 2 - v / trace(Q) I >6 + 2 



n 



op 



n,5 2 



op 



for all 5 > 0. 

(56) 



Setting 5 =(t- 2/y/n) ViQIiop in the bound flSB} yields that 



77 



|7QX|| 2 - ^trace(Q) I > t 



op 



< 2 exp 



(57) 



Similarly, setting 5 = \/ III QUI op in the tail bound (|56l) yields that with probability greater than 
1 — 2exp(— n/2), we have 



n 



1| 2 _ /trace(Q) 



i/ trace(Q) +3 



op 



< 4 



op • 



(58) 



Using these two bounds, we obtain 



\Y\\l trace(Q) 



n 



n 



n 



Y\\ 2 /trace(Q) 



n 



n 



1| 2 . /trace(Q) 



: 4t 



op 



with the claimed probability. 



□ 
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